Processor for comparing picture element blocks (blocks matching processor)

ABSTRACT

A processor is disclosed, in which a block memory (ADM), a search domain memory (SDM), a two-dimensional processor/memory cell field (PRA) and a control unit (CTRL) are preferably monolithically integrated in a semiconductor chip. The word width of toe search domain memory (SDM) is organised so that the processor/register cell field (PRA) is supplied, in parallel per system cycle (CLK) with data (SF) on picture elements of a new complete column of the search domain. At the same time, a control sequence is stored in the control unit (CTRL). The control sequence supplies data flow control signals (DFC) and addresses (ADR1, ADR2) to the block memory and search domain memory in parallel, per system cycle. The control unit essentially consists of a shift register into which external control signals (CD) of any desired control sequences may be written. An essential advantage of the invention is that it allows a comparatively high hardware utilisation even in the case of block matching algorithms based on an incomplete search.

BACKGROUND OF THE INVENTION

Processors of this type are utilized in many applications in the fieldof motion estimation such as, for example, in hybrid coding for videocompression or in a motion-compensated interpolation. A flexiblesolution is thereby desirable in order to support different methods butdifferent parameter combination as well. Examples of this includeunderscanning in the shift field, investigation of what are referred toas candidate vectors and calculations with sub-pixel precision.

All-purpose digital signal processors or flexible video processors areusually designed neither for the required calculating performance norfor the required I/O bandwidths.

The high calculating performance required therefor can be produced withtwo-dimensional cell fields. Since, however, external data can only besupplied over the cell field edge in this case, the high calculatingperformance that is available can usually only be incompletelyexploited, considerable usage losses resulting therefrom as a rule.

Up to now, high usage factors of, for example, up to 100% of saidtwo-dimensional cell fields were hitherto achieved only given dedicatedimplementations of block-matching algorithms based on a complete search.

The publication, IEEE Transactions on Circuits and Systems, Vol. 36, No.10, October 1989, pages 1309 through 1316 explains a parameterizableVLSI architecture for a block-matching algorithm that is based on acomplete search in greater detail.

Given an incomplete search, i.e. when all possible shift vectors with arespective search region are not investigated, only one sub-set of thecalculated results is required. Although a block-matching algorithmbased on an incomplete search can be realized by a flexible selection ofthe relevant results or of the processor elements to be considered inthe cell field, this occurs at the expense of substantial losses in theeffective usage of the processor circuit.

European Patent Application 0 395 293 A1 (corresponding to U.S. Pat. No.5,206,723) discloses a motion estimating means with comparisonprocessors wherein, among other things, a minimum shift vector iscalculated.

SUMMARY OF THE INVENTION

The object underlying the invention is then comprised in specifying aprocessor for comparing picture element blocks (block matchingprocessor) with a two-dimensional cell field that also offers anoptimally high hardware usage given block matching algorithms that arebased on an incomplete search.

In general terms the present invention is a processor for comparingpicture element blocks (block matching processor), whereby a blockmemory for data of two current picture element blocks, a search domainmemory for data of picture elements of a part of a comparison imagelimited by two horizontally neighboring search domains, a respectivesearch domain being composed of rows and columns, a two-dimensionalprocessor/register cell field and a control unit are provided. Thesearch domain memory is organized with respect to its word width suchthat the processor/register cell field is respectively supplied inparallel with data of picture elements of a complete column of therespective search domain per system clock. An amount is formed from arespective difference and the amounts are summed up. The search domainmemory contains data of two search domains. The two search domainshorizontally overlap one another in order to shorten a reloading of theprocessor/register cell field given a change of search domain. A controlsequence is stored in the control unit that, per system clock, suppliesparallel data flow control signals, an address for addressing the blockmemory, and a further address for addressing the respective searchdomain memory. The control unit is programmed by external control data.A shift register clocked by the system clock is provided. A flexiblecontrol sequence dependent on the respective comparison method (blockmatching algorithm) is written thereinto by the control data. Thecontrol sequence generates the output signals of the control unit suchthat only a part (dependent on the respective comparison method) of allfundamentally possible shifts between the current picture element blockand blocks of the search domain of the comparison image are compared(incomplete search).

Advantageous developments of the present invention are as follows.

The processor/register cell field is connected to and followed by a unitfor determining a minimum wherein a minimum nor is calculated fromamount sum norms. A unit for forming a probable shift vector isadditionally provided. The probable shift vector is generated from aportion of output signals of the control unit when a respective amountsum norm corresponds to the minimum norm.

Some of the output signals are data flow control signals. The unit forforming the shift vector has a respective counter of the plurality ofcounters for each component of the shift vector. The counter receivescounting pulses by means of the data flow control signals. The countersof the plurality of counters are respectively followed by a hold elementinto which the respective shift vector at the output of the counter isstored insofar as a respective amount sum norm corresponds to arespective minimum norm. A vector memory that is addressed by a part ofthe output signals of the control unit is provided in the unit forforming the shift vector. The unit for forming the shift vector containsa holding element for each component of the shift vector and in whichthe respective shift vector is stored insofar as a respective amount sumnorm corresponds to a respective minimum norm.

The unit for forming the shift vector is directly supplied with aportion of the output signals of the control unit, the portion of theoutput signals corresponding to the respective shift vector itself. Theunit for forming the shift vector contains a holding element for eachcomponent of the shift vector and in which the respective shift vectoris stored insofar as a respective amount sum norm corresponds to arespective minimum norm.

The processor is monolithically integrated on a semiconductor chip.

BRIEF DESCRIPTION OF THE DRAWINGS

The features of the present invention which are believed to be novel,are set forth with particularity in the appended claims. The invention,together with further objects and advantages, may best be understood byreference to the following description taken in conjunction with theaccompanying drawings, in the several Figures of which like referencenumerals identify like elements, and in which:

FIG. 1 is a diagram for explaining an incomplete search;

FIG. 2 is a block circuit diagram of an inventive processor with atwo-dimensional cell field;

FIG. 3 is a detailed circuit diagram of a cell field contained in FIG.2; and

FIG. 4 is a detailed circuit diagram of a processor element contained inFIG. 3.

DESCRIPTION OF THE PREFERRED EMBODIMENTS

FIG. 1 shows an example of a shift field with 11*13=143 possible shiftsthat are shown as points and with a shift 0,0 as mid-point. Differingfrom the complete search, all shifts are not required for a blockmatching algorithm based on an incomplete search; rather, only theshifts 1-9 shown in FIG. 1 as bold-face points are required. With aconventional processor, i.e. given a complete search with subsequentselection of the required shifts, 11*13=143 calculations would beimplemented here, even though, for example, only 9 of these are neededhere.

With an inventive processor, by contrast, a meander-like data flowindicated in FIG. 1 with bold face, broken-line arrows is possible,whereby the plurality of required calculations can be reduced from 143to 34. The effective usage factor of the cell field thus rises from 6%to 26%.

As explained in greater detail in IEEE Transactions on Circuits andSystems, Vol. 36, No. 10, October 1989, pages 1309 through 1316,processors having a cell field of type 1 or of type 2 are possible,whereby the cell field of type 1 effects a global accumulation and thecell field of type 2 effects a local accumulations.

The inventive processor can fundamentally comprise a cell field of type1 or, on the other hand, a cell field of type two as well, whereby thecell field of type 1 usually offers a better hardware utilization in thecase of an incomplete search.

FIG. 2 shows a block circuit diagram of an inventive processor forcomparing picture element blocks, this comprising a processor/registercell field PRA, a search domain memory SDM, a block memory ADM, acontrol unit CTRL and, potentially additionally, a unit VU for forming aprobably shift vector DV and a minimum-forming unit MIN with a feedbackregister R1. By way of example, the word widths and memory dimensionsfor a current picture element block with 16*16 picture elements and asearch domain of 48*48 picture elements are recited in FIG. 2. The blockmemory ADM for data of a current picture element block receives, forexample, 8 bits for a picture element--that correspond to a gray scalevalue or a color value--either directly from a camera or from anexternal image memory per system clock CLK. The block memory ADM iscomposed of two identical units each of which respectively having rowsand columns of 16 picture elements each like the image block, whereby 8bits are provided per picture element. These units are employed inalternating mode for reading the data into the block memory and forreading the data out into the processor/register cell field PRA. Onecolumn with 16* 8 bits here can be read out parallel at the output ofthe block memory ADM per system clock as current data AD and into theprocessor/register cell field PRA. A complete column of the currentpicture element block can be read into the block memory ADM from theexternal image store in respectively 16 system clocks.

Since the search domain here comprises, for example, 48 rows and 48columns and a column of the search domain memory SDM must be loaded witha column of the block memory ADM at the same time, 3*8 bits are suppliedto the search domain memory SDM from an external image store M persystem clock, as a result whereof a column of the search domain having48*8 bits here can be read into the search domain memory SDM afterrespectively 16 system clocks. At the output of the search domain memorySDM, a column having 48*8 bits here can be read in parallel into theprocessor/register field PRA as search domain data. The search domainmemory SDM thereby comprises an organization of 48*64*8 bits sincerespectively two successive search domains overlap in a region of 48*32bits and reading is respectively undertaken from a block of 48*48picture elements and writing is respectively undertaken in a block of16*48 picture elements. What is thereby critical to the invention isthat, once the search domain memory SDM has been completely written withdata, any arbitrary shift within a respective search domain is possiblewithin a maximum of 16 system clocks be reloading corresponding columnsfrom the search domain memory into the processor/register field. Theshifts ensue horizontally only in the direction of the cell field outputand vertically both downward as well as upward.

By way of example, FIG. 3 shows a processor/register cell field PRA fora global accumulation in detail. Advantageously, one processor unit isprovided per picture element of the current picture element block, i.e.processor units PE1,1 . . . PE16,16 in the example of 16*16 pictureelements. Since a search domain column comprises 48 picture elementshere, each column of processor units is upwardly supplemented by 16registers and is downwardly supplemented by 16 registers. From top tobottom, the first column of the processor/register field shown in FIG. 3is thus composed of 16 registers RE1,1 . . . RE16,1, processor unitsPE1,1 . . . PE16,1, and of 16 more register units RE1,1' . . . RE16,1'.The other columns of the processor/register cell field arecorrespondingly organized, whereby the last column is composed of 16register units RE1,16 . . . RE16,16, processor units PE1,16 . . .PE16,16, and of another 16 register units 1,16' . . . RE16,16'. Outputdata SD1. SD16 respectively 8 bits wide are supplied parallel to theregister units RE1, 16 . . . RE16,16, the respectively 8-bit wide outputdata SD17 . . . SD32 of the search domain memory SDM are suppliedparallel to the processor units PE1,16 . . . PE16,16, and the outputdata SD33 . . . SD48 of the search domain memory SDM that, for example,are 8 bits wide are supplied parallel to the register units RE1,16' . .. RE16,16'. For further improvement of the usage, a means for thecyclical permutation of the supplied data is also conceivable in orderto avoid pure shift operations in the processor/register field. Over andabove this, the processor units PE1,16 . . . PE16,16 are respectivelysupplied with output data AD1 . . . AD16 of the block memory ADM that,for example, are 8 bits wide. The processor units of the sixteenthcolumn can thereby transfer search domain data and current data forintermediate results to the fifteenth column, etc., before theintermediate results of the first column are transferred to an addermeans ADD via intermediate result outputs, for example ZO.

The adder means ADD is thereby composed, for example, of a binary addertree, i.e. 8 adders here in the first level;, 4 in the second level, 2in the third level and 1 adder in the fourth level. The amount sum normN is thereby formed at the adder means ADD output from differences fromthe data for the current picture element block and the data of pictureelements of the part of the comparison image limited by the searchdomain.

Analogous to the processor units, the register units of the sixteenthcolumn RE1,16 . . . RE16,16' can transfer data to register units of thefifteenth column, etc., all the way to the register units RE1,1 . . .RE16,1'. A bidirectional data transport is possible between the registerunits and processor units or, respectively, register units and registerunits, whereby data are respectively forwarded by one row during asystem clock.

FIG. 4 shows a detailed circuit diagram of a processor unit PE, wherebythe processor unit PE itself also contains a register unit RE. Aregister unit RE is composed of a multiplexer MUX and of a followingregister R2 whose output simultaneously supplies search domain dataoutput signals SDO_(i) of the register unit RE or, respectively, of theprocessor unit PE. Dependent on a data flow control signal DFC that is 2bits wide, search domain data input signals SDI_(i+1) and SDI_(i-1) ofthe neighboring rows as well as a search domain data inputs signalSDI_(i) of a neighboring column preceding in data flow direction areoptionally through-connected onto the input of the register R2. Theoutput of the register R2 supplies search domain data output signalsSDO_(i+1) and SDO for register units in neighboring cells as well as asearch domain data output signal SDO_(i) for a register unit of a columnof the cell field that follows in data flow direction. An input signalADI for data of a current picture element block is forwarded in theprocessor unit PE via a register R3 to the respective output, whereby anoutput signal ADO for data of a current picture element block is presentat the output of the processor unit PE.

In addition to containing the register unit RE, the processor unit PEcontains a switch SW, a register R4, a subtraction/amount-forming unit Band a summation unit A. During the processing of a block, the switch SWis opened and the data of a current picture element block stored in thebuffer register R4 are adjacent at the plus input of thesubtraction/amount-forming unit B. Only when a new block is loaded isthe switch SW closed and the buffer register R4 loaded with data of anew, current picture element block. The buffer register R4 serves morefor decoupling and may potentially not be required dependent on thecircuit-oriented implementation of the processor unit. The minus inputof the subtractor/amount-forming unit is supplied with the signalSDO_(i), and the output of the subtraction/amount-forming unit isconnected to an input of the addition unit A, whose second input issupplied with an input signal for an intermediate result of a column ofthe cell field that precedes in data flow direction, and the outputthereof supplying an output signal ZO for a column following in dataflow direction.

A processor/register cell field for a local accumulation is constructedsimilar to that shown in FIG. 3, whereby the critical differences arecomprised therein that no intermediate results are forwarded betweenprocessor units but are further-processed in this itself and, via afurther multiplexer, either the norm at the output of the adder unit Aof the respective processor element or norms of processor elementspreceding in data flow direction proceed as norm N to the output of thecell field PRA.

The control unit CTRL is advantageously composed of a shift registerthat is clocked by the system clock CLK and whose content is freelyprogrammed by external control data CD. Alternatively to the shiftregister or comparably organized, other write/read memories, a controlunit on the basis of a read-only memory is also fundamentallyconceivable. The control unit must thereby be of such a nature that acontrol sequence can be stored therein that, per system clock CLK,supplies parallel data flow control signals DFC, an address ADR1 foraddressing the block memory and a further address for addressing thesearch domain memory.

When it is not only amount sum norms N but a probable shift vector DVthat are to be calculated in the processor, then the unit MIN fordetermining a minimum as well as, additionally, the unit VU for forminga probable shift vector DV are additionally provided in the processor.The output signal of the unit MIN is thereby returned via a register R1onto an input of the unit MIN, and the further input of the unit MIN issupplied with the amount sum norms N of the processor/register cellfield.

Dependent on the embodiment of the unit VU, the probably shift vector DVcan be formed in the respective unit VU either with the assistance ofthe data flow control signals DFC of the control unit CTRL or, indicatedwith broken lines in FIG. 2, with the assistance of output signals VD ofthe control unit additionally stored in the control unit. The additionaloutput signals VD can thereby be composed of vector data themselves or,on the hand, of addresses of vector data.

When only the data flow control signals DFC are utilized for forming theshift vector DV, then the unit VU contains a counter for each componentof the probable shift vector DV that can receive positive or negativecounting pulses by means of the data flow control signals DFC and thatis respectively followed by a hold element in which the respective shiftvector adjacent at the output of the counter is stored, insofar as arespective amount sum norm N corresponds to a respective minimum normNMIN.

When the additional output signals VD of the control unit CTRL arecomposed of vector addresses, then it is not a counter but a vectormemory that is provided in the unit VU, this vector memory beingaddressed by the vector addresses. Further, a hold element in which therespective component of the shift vector is stored insofar as arespective amount sum norm N corresponds to a respective norm NMIN isprovided in the unit VU for each component of the shift vector.

When the additional output signal VD of the control unit is composed ofthe respective shift vector itself, then the respective shift vector istransferred in hold elements as probable shift vector insofar as arespective amount sum norm N corresponds to a respective minimum normNMIN.

An advantageous development of the inventive processor derives in thatthe processor is monolithically integrated on a semiconductor chiptogether with the memories ADM and SDM.

The invention is not limited to the particular details of the apparatusdepicted and other modifications and applications are contemplated.Certain other changes may be made in the above described apparatuswithout departing from the true spirit and scope of the invention hereininvolved. It is intended, therefore, that the subject matter in theabove depiction shall be interpreted as illustrative and not in alimiting sense.

What is claimed is:
 1. A block matching processor for comparing pictureelement blocks, comprising:a block memory for data of two currentpicture element blocks, a search domain memory for data of pictureelements of a part of a comparison image limited by two horizontallyneighboring search domains, a respective search domain being composed ofrows and columns, a two-dimensional processor/register cell fieldconnected to the block memory and to the search domain memory, and acontrol unit connected to the block memory, to the search domain memory,and to the processor/register cell field; the search domain memoryorganized with respect to a word width thereof such that theprocessor/register cell field is respectively supplied in parallel withdata of picture elements of a complete column of the respective searchdomain per system clock, a respective amount being formed from arespective difference and respective amounts being summed up; the searchdomain memory having data of two search domains, the two search domainshorizontally overlapping one another in order to shorten a reloading ofthe processor/register cell field given a change of search domain; acontrol sequence stored in the control unit that, per system clock,supplies parallel data flow control signals, a first address foraddressing the block memory, a second address for addressing therespective search domain in the search domain memory; the control unitbeing programmed by external control data; and a shift register clockedby the system clock, a flexible control sequence dependent on arespective comparison method being written thereinto according to thecontrol data, said control sequence generating the output signals of thecontrol unit such that only a part which is dependent on the respectivecomparison method of all possible shifts between the current pictureelement block and blocks of the search domain of the comparison imageare compared.
 2. The processor according to claim 1, wherein theprocessor/register cell field is connected to and followed by a unit fordetermining a minimum wherein a minimum norm is calculated from amountsum norms;wherein a unit for forming a probable shift vector isadditionally provided; wherein the probable shift vector is generatedfrom a portion of output signals of the control unit when a respectiveamount sum norm corresponds to the minimum norm.
 3. The processoraccording to claim 2,wherein some of the output signals are data flowcontrol signals and wherein a plurality of counters are provided;wherein the unit for forming the shift vector has a respective counterof the plurality of counters for each component of the shift vector,said counter receiving counting pulses by means of the data flow controlsignals; and the counters of the plurality of counters respectivelyfollowed by a hold element into which the respective shift vector at theoutput of the counter is stored insofar as a respective amount sum normcorresponds to a respective minimum norm.
 4. The processor according toclaim 2,wherein a vector memory that is addressed by a part of theoutput signals of the control unit is provided in the unit for formingthe shift vector; and wherein the unit for forming the shift vectorcontains a holding element for each component of the shift vector and inwhich the respective shift vector is stored insofar as a respectiveamount sum norm corresponds to a respective minimum norm.
 5. Theprocessor according to claim 2,wherein the unit for forming the shiftvector is directly supplied with a portion of the output signals of thecontrol unit, the portion of the output signals corresponding to therespective shift vector itself; and wherein the unit for forming theshift vector contains a holding element for each component of the shiftvector and in which the respective shift vector is stored insofar as arespective amount sum norm corresponds to a respective minimum norm. 6.The processor according to claim 1 wherein the processor ismonolithically integrated on a semiconductor chip.